While producing a shiny dashboard application, I had the opportunity to analyse the influence of various factors affecting coffee ratings. I designed a storyboard to address my findings by means of exploration and visualisation. For this report though I hope to develop two models - classification and regression - to predict one qualitative and one quantitative measure. This is my first attempt at applying machine learning concepts to the real life dataset and I anticipate it to be an amazing learning curve.
Coffee Ratings dataset is about two species of beans - Arabica and Robusta - with different varieties, across many countries and regions, grown in low to high altitudes and with different processing methods. It was rated on a 0 - 10 scale for distinct attributes like aroma, flavour, aftertaste, acidity etc.
Buzzfeed data scientist James LeDoux collected the original data from the Coffee Quality Institute’s review pages in January 2018 and then cleaned and published on his GitHub profile as Coffee-Quality-Database. I sourced Coffee Rating dataset from tidyTuesday GitHub page published as part of their weekly dataset challenge by R4DS.
This dataset has a lot of missing values and repetition of variables in some of the columns. This should not restrict my ability to find the right answers as there are other key variables with minimal or no missing values.
This report is composed of five sections:
How accurately variety can be predicted using the variables - total_cup_points, species, country_of_origin, processing_method, color, moisture and altitude_mean_meters?
What are the best variables to predict total cup points with great accuracy?
The Coffee Ratings dataset were originally collected from the Coffee Quality Institute’s review pages in January 2018. The data was then published on Buzzfeed Data Scientist James LeDoux’s github as Coffee-Quality-Database in both raw and cleaned form. It was released on 7 July 2020, as part of their weekly dataset challenge in the Tidy Tuesday Project by R4DS.
This data records two species of coffee beans (Arabica and Robusta) by various attributes like variety, processing methods, colour and scores given by professional graders for aroma, flavour, aftertastes, acidity etc. It also includes metadata like owner, country of origin, producer, harvest year among others.
The dataset has 1339 observations and 43 features. The features include character and numeric variables. Between the two species of coffee beans - Robusta only has 28 observations, rest are all Arabica. Data was collected for the harvest year 2009-18 and grading is available for the same period on a yearly basis. Although it has various metadata columns, lot of missing values makes few of those columns almost redundant to work with. Coffee grading variables ‘uniformity’, ‘clean_cup’, and ‘sweetness’ have perfect 10 score for more than 90% of their observations. Altitude has four separate columns with various quantification but it seems ‘altitude_mean_meters’ would be more suitable to employ in the models due to the least number of values missing and mean measurement is likely to provide better prediction accuracy.
library(tidyverse)
library(dplyr)
library(visdat)
library(rsample)
library(randomForest)
library(caret)
library(modelr)
library(forecast)
library(broom)
library(ggplot2)
library(plotly)
library(wesanderson)
coffee_ratings <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-07/coffee_ratings.csv')
Although James LeDoux transformed raw data into more usable format, there are still some cleaning to be done to improve data quality and overall productivity. I will start with drawing a vis_miss plot from ‘naniar’ package to highlight the missing observations for each column.
vis_miss(coffee_ratings) +
theme(axis.text.x = element_text(size = 8, angle = 90))
As you can see values missing mostly in farm_name, lot_number, mill, ico_number, company, altitude and producer. Excluding these variables will not have much impact on developing our models. For the remaining variables, decided not to impute over missing values considering there are very few of the missing values in those columns. Instead, I will use drop_na to drop rows containing missing observations.
coffee_ratings <- coffee_ratings %>%
select(total_cup_points, species, country_of_origin, processing_method, variety,
aroma:cupper_points, moisture, color, altitude_mean_meters) %>%
drop_na()
Now I will make use of R base boxplot to visualise the distribution of numerical data and skewness.
coffee_ratings %>%
select_if(is.numeric) %>%
boxplot(las = 2, xlab = "", ylab = "")
It seems data is evenly spread for all the variables except for ‘altitude_mean_meter’. I will remove these extreme values by filtering the dataset and keep the rest to ensure data integrity.
coffee_ratings <- coffee_ratings %>%
filter(altitude_mean_meters < 3500)
With many machine learning algorithms available, I found it challenging to choose the most appropriate algorithm for finding answers to my research questions. I decided to follow a methodical decision making process to select the best model.
In essence, supervised learning uses labelled input and output data, while an unsupervised learning algorithms analyse and cluster unlabeled data sets. Coffee Ratings dataset is perfectly labeled, it’s input can be fed into the model along with the output. The model can learn from the training dataset and then evaluate the accuracy of the learned function on a test set.
Supervised learning algorithms can be split into two categories: classification and regression. Classification algorithms are used to predict categorical/discreet values, while regression algorithms predict continuous values. I will have the opportunity to use both for my research questions.
How accurately variety can be predicted using the variables - total_cup_points, species, country_of_origin, processing_method, color, moisture and altitude_mean_meters?
The output variable - variety - is categorical with 25 distinct categories like Caturra, Bourbon, Typica etc. Therefore, a classification model provided with several input variables can predict the variety category. Examples of the common classification algorithms include logistic regression, decision trees, Naive Bayes, and K-nearest neighbours.
Following the consideration of logistic regression, K-nearest neighbours and random forest for classification, I became more biased towards developing a random forest model:
What are the best variables to predict total cup points with great accuracy?
The outcome or response variable - total_cup_points - is a numeric column which consists of continuous values. Hence, a regression algorithm would be best suited to predict total cup points. The different types of regression algorithms include simple linear regression, multiple linear regression and polynomial regression.
Considering we have two or more independent variables and linear relationship exists between dependent variable (total_cup_points) and the predictors, multiple linear regression (MLR) would be useful in this scenario:
Before starting to build the models, I want to examine if ‘total_cup_points’ are aggregate of the variables ‘aroma:cupper_points’. I can use ‘pivot_longer’ function to validate my assumption.
coffee_ratings %>%
select(total_cup_points, aroma:cupper_points) %>%
mutate(row_num = row_number()) %>%
pivot_longer(aroma:cupper_points, names_to = "name", values_to = "value") %>%
group_by(row_num) %>%
summarise(total_cup_points = mean(total_cup_points),
value = sum(value))
## # A tibble: 898 × 3
## row_num total_cup_points value
## <int> <dbl> <dbl>
## 1 1 89.9 89.9
## 2 2 88.8 88.8
## 3 3 88.2 88.2
## 4 4 87.2 87.3
## 5 5 87.2 87.2
## 6 6 87.2 87.2
## 7 7 87.2 87.2
## 8 8 86.9 86.9
## 9 9 86.8 86.9
## 10 10 86.6 86.6
## # … with 888 more rows
This being the case I have chosen only to include 2 columns from aroma:cupper_points for the regression algorithm. The reason being using more of these predictors will lead to the model ignoring all other predictors.
In order to implement into the modeling correctly and to ensure more efficient use of memory, I will convert character variables to factors.
# converting character variables to factor
coffee_factored <- coffee_ratings %>%
mutate(species = as.factor(species)) %>%
mutate(country_of_origin = as.factor(country_of_origin)) %>%
mutate(processing_method = as.factor(processing_method)) %>%
mutate(color = as.factor(color))
How accurately variety can be predicted using the feature variables - total_cup_points, species, country_of_origin, processing_method, color, moisture and altitude_mean_meters?
It would be nice to consider all 25 different varieties for this model, but to get better prediction outcome we will select only the top 6 varieties by number of observations, which accounts for about 80% of the data.
top_variety <- coffee_factored %>%
count(variety = factor(variety)) %>%
mutate(pct = prop.table(n)) %>%
arrange(-n)
top_variety
## # A tibble: 24 × 3
## variety n pct
## <fct> <int> <dbl>
## 1 Caturra 217 0.242
## 2 Bourbon 197 0.219
## 3 Typica 182 0.203
## 4 Other 83 0.0924
## 5 Catuai 65 0.0724
## 6 Yellow Bourbon 30 0.0334
## 7 Mundo Novo 25 0.0278
## 8 Catimor 19 0.0212
## 9 SL14 16 0.0178
## 10 SL28 13 0.0145
## # … with 14 more rows
coffee_select_variety<- coffee_factored %>%
filter(variety %in% c("Bourbon", "Catimor", "Catuai", "Caturra", "Typica", "Yellow Bourbon"))
To accurately evaluate our model and to prevent overfitting, I split the data 70/30 into train and test sets.
# split data into training and test sets
set.seed(123)
coffee_split <- initial_split(coffee_select_variety, prop = .7)
coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)
coffee_train$variety <- as_factor(coffee_train$variety)
Now, I will create a random forest model with default parameters. By default, number of trees is 500 and number of variables tried at each split is 2 as classification trees have a default setting of the square root of the number of variable. Model will make use of the ‘variety’ as a function of ‘total_cup_points’, ‘species’, ‘country_of_origin’, ‘processing_method’, ‘moisture’, ‘color’ and ‘altitude_mean_meters’ using ‘coffee_train’ data. This is a classification problem as ‘variety’ is a factor variable. Later I will check to see if 500 trees is enough for optimal classification.
# fit random forest to the training data
model1 <- randomForest(variety ~ total_cup_points + species + country_of_origin + processing_method + moisture + color + altitude_mean_meters, data = coffee_train)
model1
##
## Call:
## randomForest(formula = variety ~ total_cup_points + species + country_of_origin + processing_method + moisture + color + altitude_mean_meters, data = coffee_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 25.6%
## Confusion matrix:
## Caturra Yellow Bourbon Bourbon Catuai Typica Catimor class.error
## Caturra 105 0 20 3 17 1 0.28082192
## Yellow Bourbon 0 11 7 0 3 0 0.47619048
## Bourbon 4 5 109 2 20 0 0.22142857
## Catuai 13 1 15 15 3 0 0.68085106
## Typica 4 0 5 0 123 0 0.06818182
## Catimor 3 0 0 0 1 6 0.40000000
Model summary shows OOB error estimate of 25.6%, means that 74.4% of the OOB samples were correctly classified by the random forest model. Then we have a confusion matrix, which shows how many variety of the beans correctly labeled or otherwise for each variety. For instance, there were 105 Caturra variety that were classified correctly but 13 were classified incorrectly as Catuai.
Next up I will do the prediction on training data by specifying the model created.
# predict the training set result
pred_coffee_train <- predict(model1, coffee_train)
matrix_train <- confusionMatrix(pred_coffee_train, coffee_train$variety)
matrix_train$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 9.375000e-01 9.166052e-01 9.124564e-01 9.571429e-01 2.943548e-01
## AccuracyPValue McnemarPValue
## 3.737945e-203 NaN
This model has an accuracy score of 93.75% on the training data. That is quite impressive but expected since training data already been seen by the model. More important measure would be the accuracy score for the test data.
# predict the testing set result
coffee_test$variety <- as_factor(coffee_test$variety)
pred_coffee_test <- predict(model1, coffee_test)
matrix_test <- confusionMatrix(pred_coffee_test, coffee_test$variety)
matrix_test$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.570093e-01 6.766150e-01 6.938695e-01 8.128812e-01 3.317757e-01
## AccuracyPValue McnemarPValue
## 4.839898e-37 NaN
We see accuracy dropped down to 75.70% for the test data. This is more accurate assessment of the accuracy of our model.
Now lets draw a ggplot to look at the error rate in the model.
oob.error.data <- data.frame(
Trees = rep(1:nrow(model1$err.rate), times = 7),
Type = rep(c("OOB", "Caturra", "Catuai", "Bourbon", "Typica", "Yellow Bourbon", "Catimor"), each = nrow(model1$err.rate)),
Error = c(model1$err.rate[,"OOB"],
model1$err.rate[,"Caturra"],
model1$err.rate[,"Catuai"],
model1$err.rate[,"Bourbon"],
model1$err.rate[,"Typica"],
model1$err.rate[,"Yellow Bourbon"],
model1$err.rate[,"Catimor"]))
error_plot <- ggplot(data = oob.error.data, aes(x = Trees, y = Error)) +
geom_line(aes(color = Type)) +
theme_bw() +
scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")) +
labs(title = "Distribution of error") +
theme(legend.position = "bottom") +
theme(legend.title = element_blank())
interactive_error_plot <- ggplotly(error_plot)
interactive_error_plot
Here, each line represents error rate for classifying different variety and the yellow line shows the overall OOB error rate. We see the error rates decrease when our random forest has more trees. It can be tweaked by adding more trees to see if error rates will go down further.
As one of the best “out-of-the-box” machine learning algorithm, random forests perform exceptionally well with very little tuning required. Even so we can try to tune the parameters of our model for better accuracy.
coffee_train <- as.data.frame(coffee_train)
tuneRF(coffee_train[,-5], coffee_train[,5],
stepFactor = 0.5,
plot = TRUE,
ntreeTry = 500,
trace = TRUE,
improve = 0.05)
OOB Error was initially quite high with mtry at 1 and then it reaches the bottom with mtry at 8. This gives us an idea of which mtry value to choose.
Now we can refine our model by replacing the values for ntree and mtry, also adding few more parameters.
model1_refined <- randomForest(variety ~ total_cup_points + species + country_of_origin + processing_method + moisture + color + altitude_mean_meters, data = coffee_train, ntree = 540, mtry = 8, importance = TRUE, proximity = TRUE)
model1_refined
##
## Call:
## randomForest(formula = variety ~ total_cup_points + species + country_of_origin + processing_method + moisture + color + altitude_mean_meters, data = coffee_train, ntree = 540, mtry = 8, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 540
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 27.42%
## Confusion matrix:
## Caturra Yellow Bourbon Bourbon Catuai Typica Catimor class.error
## Caturra 104 0 17 8 14 3 0.2876712
## Yellow Bourbon 0 14 4 0 3 0 0.3333333
## Bourbon 14 5 100 4 17 0 0.2857143
## Catuai 12 0 10 22 3 0 0.5319149
## Typica 11 0 7 0 113 1 0.1439394
## Catimor 3 0 0 0 0 7 0.3000000
There are 540 trees used in the model with mtry of 8. OOB error rate slightly climbed to 27.42% from earlier derived rate of 25.6%.
# predict the training set result
pred_coffee_train1 <- predict(model1_refined, coffee_train)
matrix_train1 <- confusionMatrix(pred_coffee_train1, coffee_train$variety)
matrix_train1$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 1.000000e+00 1.000000e+00 9.925903e-01 1.000000e+00 2.943548e-01
## AccuracyPValue McnemarPValue
## 3.631658e-264 NaN
Training accuracy increased to 100% from 93.75% but again the real test would be to find how model perform on test data.
# predict the testing set result
pred_coffee_test1 <- predict(model1_refined, coffee_test)
matrix_test1 <- confusionMatrix(pred_coffee_test1, coffee_test$variety)
matrix_test1$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 7.663551e-01 6.908051e-01 7.038400e-01 8.213331e-01 3.317757e-01
## AccuracyPValue McnemarPValue
## 1.172681e-38 NaN
Accuracy score of the model on the test data increased to 76.63% from the score achieved (75.70%) in the first attempt. So overall the model performs well to predict variety.
What are the best variables to predict total cup points with great accuracy?
To start with I will plot the selected variables to find out the relationship between them.
coffee_ratings1 <- coffee_factored %>%
select(total_cup_points, species, processing_method, flavor, cupper_points, moisture, altitude_mean_meters)
plot(coffee_ratings1)
From the above plot we can see total cup points are strongly correlated with both flavor and cupper points but not so much with the other variables. Also flavor and cupper points are correlated - this means that they provide similar information and we might not need both in our model. For this model though we will go ahead with using the variables ‘species’, ‘processing_method’, ‘flavor’, ‘cupper_points’, ‘moisture’ and ‘altitude_mean_meters’. Later we will utilise backward prediction to shortlist the best predictor variables.
# split data into training and test sets
set.seed(123)
coffee_split <- initial_split(coffee_ratings1, prop = 0.7)
coffee_train <- training(coffee_split)
coffee_test <- testing(coffee_split)
After splitting the data into training and test set we can use lm() function to fit a plane to the training data.
model2 <- lm(total_cup_points ~ species + processing_method + flavor + cupper_points + moisture + altitude_mean_meters, data = coffee_train)
summary(model2)
##
## Call:
## lm(formula = total_cup_points ~ species + processing_method +
## flavor + cupper_points + moisture + altitude_mean_meters,
## data = coffee_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6604 -0.3099 0.1588 0.5701 4.0413
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 29.9657537 1.1854443 25.278
## speciesRobusta -2.5410844 0.8221352 -3.091
## processing_methodOther -0.7222348 0.3627558 -1.991
## processing_methodPulped natural / honey 0.5143378 0.3961015 1.299
## processing_methodSemi-washed / Semi-pulped 0.1383338 0.2395525 0.577
## processing_methodWashed / Wet 0.1862124 0.1287329 1.447
## flavor 4.1866923 0.2495902 16.774
## cupper_points 2.7448343 0.2119541 12.950
## moisture -1.2129923 1.1303863 -1.073
## altitude_mean_meters 0.0001550 0.0001118 1.386
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## speciesRobusta 0.00209 **
## processing_methodOther 0.04692 *
## processing_methodPulped natural / honey 0.19460
## processing_methodSemi-washed / Semi-pulped 0.56383
## processing_methodWashed / Wet 0.14854
## flavor < 2e-16 ***
## cupper_points < 2e-16 ***
## moisture 0.28366
## altitude_mean_meters 0.16628
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.14 on 618 degrees of freedom
## Multiple R-squared: 0.7878, Adjusted R-squared: 0.7847
## F-statistic: 254.9 on 9 and 618 DF, p-value: < 2.2e-16
From the above summary we see the p-value of <0.05 for ‘speciesRobusta’, ‘flavor’ and ‘cupper_points’ which means these variables are contributing significantly to the model. On the contrary ‘processing_methodOther’, ‘processing_methodSemi-washed / Semi-pulped’ and ‘moisture’ have very high p-values. Multiple R-squared of 0.7727 means that variables in the model are contributing 77.27% of overall variability, so there is a room for improvement by adding other variables. F-statistic of 236 indicates R-squared is significant and p-value of <0.05 at the bottom says that variables used in the model gives us the reliable estimate of ‘total_cup_points’.
glance(model2)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.788 0.785 1.14 255. 1.80e-201 9 -969. 1959. 2008.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
We can also look at Akaike information criterion (AIC) and Bayesian information criterion (BIC) to further assess the quality of our model. Here, AIC and BIC have very high values, models with lower values are considered for better quality. Next we will apply Backward Elimination to reduce the number of predictors and reducing the multicollinearity problem. It can also resolve the model overfitting.
model2_step_b <- step(model2, direction = "backward")
## Start: AIC=174.91
## total_cup_points ~ species + processing_method + flavor + cupper_points +
## moisture + altitude_mean_meters
##
## Df Sum of Sq RSS AIC
## - moisture 1 1.50 805.18 174.08
## - altitude_mean_meters 1 2.50 806.18 174.86
## <none> 803.69 174.91
## - processing_method 4 12.17 815.86 176.35
## - species 1 12.42 816.11 182.54
## - cupper_points 1 218.10 1021.78 323.69
## - flavor 1 365.92 1169.60 408.54
##
## Step: AIC=174.08
## total_cup_points ~ species + processing_method + flavor + cupper_points +
## altitude_mean_meters
##
## Df Sum of Sq RSS AIC
## - altitude_mean_meters 1 2.22 807.40 173.81
## <none> 805.18 174.08
## - processing_method 4 12.04 817.22 175.39
## - species 1 11.98 817.16 181.35
## - cupper_points 1 227.76 1032.94 328.51
## - flavor 1 364.99 1170.18 406.85
##
## Step: AIC=173.81
## total_cup_points ~ species + processing_method + flavor + cupper_points
##
## Df Sum of Sq RSS AIC
## <none> 807.40 173.81
## - processing_method 4 14.55 821.95 177.02
## - species 1 10.88 818.28 180.21
## - cupper_points 1 232.23 1039.63 330.56
## - flavor 1 367.12 1174.52 407.17
Backward elimination started with AIC of 243.08 and 6 features. After 4 elimination phases, AIC value reduced to 235.94 and only 3 variables remaining - ‘species’, ‘cupper_points’ and ‘flavor’.
summary(model2_step_b)
##
## Call:
## lm(formula = total_cup_points ~ species + processing_method +
## flavor + cupper_points, data = coffee_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6370 -0.2955 0.1745 0.5767 4.1672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.6040 1.1537 25.659 < 2e-16
## speciesRobusta -2.3547 0.8147 -2.890 0.00398
## processing_methodOther -0.7148 0.3627 -1.971 0.04922
## processing_methodPulped natural / honey 0.5302 0.3962 1.338 0.18132
## processing_methodSemi-washed / Semi-pulped 0.1193 0.2391 0.499 0.61802
## processing_methodWashed / Wet 0.2284 0.1229 1.857 0.06374
## flavor 4.1903 0.2496 16.790 < 2e-16
## cupper_points 2.7967 0.2094 13.354 < 2e-16
##
## (Intercept) ***
## speciesRobusta **
## processing_methodOther *
## processing_methodPulped natural / honey
## processing_methodSemi-washed / Semi-pulped
## processing_methodWashed / Wet .
## flavor ***
## cupper_points ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.141 on 620 degrees of freedom
## Multiple R-squared: 0.7868, Adjusted R-squared: 0.7844
## F-statistic: 326.9 on 7 and 620 DF, p-value: < 2.2e-16
Stepwise model summary displays significant improvement in F-statistic value along with overall p-value of <0.05. Larger F-statistic value suggests that stepwise model provides a better “goodness-of-fit”.
wp <- wes_palettes$Rushmore1
wp2 <- wes_palettes$BottleRocket1
resid_plot <- ggplot(model2_step_b, aes(.fitted, .resid)) +
geom_ref_line(h = 0) +
geom_point(alpha = 0.6, color = wp2[5]) +
geom_smooth(se = FALSE, color = wp[4]) +
theme_bw() +
ggtitle("Residuals vs Fitted")
interactive_resid_plot <- ggplotly(resid_plot)
interactive_resid_plot
Residuals vs fitted plot shows residuals are spread around the ‘0’ line. This suggests the assumptions that the relationship is linear is reasonable. There are few residual distanced from the basic random pattern of residuals, which suggests that some outliers are present in the data. Model derived from backward elimination does not seem to suffer from heteroscedasticity.
Now we can use predict() function to check the accuracy on the test data.
model2.pred <- predict(model2_step_b, coffee_test)
For each of the observations in the test data, we can compare predicted outcome y^ and the actual value y to obtain the residuals.
some.residuals <- coffee_test$total_cup_points - model2.pred
resid.data <- data.frame("Predicted" = model2.pred,
"Actual" = coffee_test$total_cup_points,
"Residual" = some.residuals)
resid_plot <- ggplot(resid.data, aes(x=some.residuals)) +
geom_histogram() +
theme_bw() +
scale_fill_brewer(palette = "Set2") +
labs(title = "Residual Histogram")
interactive_resid_plot <- ggplotly(resid_plot)
interactive_resid_plot
Above histogram shows an approximately normal distribution of residuals with most of the errors are between (-2, 2). But there are few outliers. This error magnitude might be small relative to the total cup points but should be taken into account when making predictions.
Random Forest
Random forest model achieving accuracy of 100% on training data indicates either the model is overfitted or algorithm learned exceptionally well to predict variety with precision.
Test classification accuracy of 76.63% achieved by our model further proves that the model is overfitted due to the large variation in accuracy rate between training and test set. Perhaps additional tuning of the hyperparameters will accomplish better outcome.
Sample size is probably not large enough for this kind of machine learning algorithm. Moreover, having additional features like region, producer as predictors can be an effective mechanism in improving the model.
Even though accuracy is the best measure to find the effectiveness of the model, there are other measurements that can be applied - like sensitivity and specificity to evaluate model performance.
Out-of-Bag error (OOB) rate of 27.42% - means 72.58% of the OOB samples were correctly classified by the random forest - which is different from the test accuracy of 76.63%. OOB error signifies that for each bootstrap iteration and related trees, prediction error using data not in bootstrap sample is estimated. Whereas test accuracy score is how well the model actually performed on test data.
Overall the model performs well but to get more nuanced understanding of the model we need to explore other summary metrics like confusion matrix, precision, recall and F1 score.
Multiple Linear Regression
Multiple R-squared of 0.7868 along with the p-value of <0.05 implies predictor variables are relatively accurate in predicting total cup points. Although R-squared can be improved by adding more predictor variables in the regression model.
Initial F-statistic value of 254.9 increased to 326.9 after backward elimination means more of the variation per parameter is explained by the model.
Reduced AIC of 173.81 achieved by means of removing response variables with weak correlation will ensure better model performance.
As illustrated in the histogram, residuals are normally distributed validates the assumption of running a linear model. It also suggests that model has a good fit, and reliable for prediction.
Coffee Ratings. (2022). https://nayeembhuiyan.shinyapps.io/Coffee-rating/
Plotly. (2021). Plotly R open source graphing library. https://plotly.com/r/
rfordatascience/tidytuesday GitHub. (2022). Coffee rating. https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-07-07/readme.md
R Studio. (2021). RStudio cheatsheets. https://www.rstudio.com/resources/cheatsheets/
Supervised vs. Unsupervised Learning: What’s the Difference? (2021). https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning
Tierney, N. (2021). Getting started with naniar. http://naniar.njtierney.com/articles/getting-started-w-naniar.html#introduction
Tierney, N. (2020). RMarkdown for scientists. https://rmd4sci.njtierney.com/
UC Business Analytics R Programming Guide. Linear Regression. http://uc-r.github.io/linear_regression
UC Business Analytics R Programming Guide. Random Forests. http://uc-r.github.io/random_forests
Wickham, H. & Grolemund, G. (2017). R for data science. O’Reilly Media , Inc..